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Introduction 

Development of technologies for exploration of the solar system has revived an interest in computational 
simulation of chemically reacting flows since planetary probe vehicles exhibit non-equilibrium phenomena 
during the atmospheric entry of a planet or a moon as well as the reentry to the Earth. Stability in 
combustion is essential for new propulsion systems. Numerical solution of real-gas flows often increases 
computational work by an order-of-magnitude compared to perfect gas flow partly because of the increased 
complexity of equations to solve. Recently, as part of Project Columbia, NASA has integrated a cluster of 
interconnected SGI Altix systems to provide a ten-fold increase in current supercomputing capacity that 
includes an SGI Origin system. Both the new and existing machines are based on cache coherent non- 
uniform memory access architecture. 

Lower-Upper Symmetric Gauss-Seidel (LU-SGS) relaxation method 1 has been implemented into both 
perfect and real gas flow codes 2 ** including Real-Gas Aerodynamic Simulator (RGAS) 9 . However, the 
vectorized RGAS code runs inefficiently on cache-based shared-memory machines such as SGI systems. 
Parallelization of a Gauss-Seidel method is nontrivial due to its sequential nature. 

The LU-SGS method has been vectorized on an oblique plane in 1NS3D-LU code 4 that has been one of the 
base codes for NAS Parallel Benchmarks 10 . The oblique plane has been called a hyperplane by computer 
scientists. It is straightforward to parallelize a Gauss-Seidel method by partitioning the hyperplanes once 
they are formed. Another way of parallelization is to schedule processors like a pipeline using software 
Both hyperplane and pipeline methods have been implemented using openMP directives. The present paper 
reports die performance of the- parallelized RGAS code on SGI Origin and Altix systems. 


Numerical Methods 

Let t be time; Q the vector of conserved variables; £, F, and G die convective flux vectors; E v , F v , and 
G v the flux vectors for the viscous terms. The source term S represents production or destruction of 
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species due to chemical reactions. The three-dimensional Navier-Stokes and species transport equations in 
generalized curvilinear coordinates ( 2 =, t], 5 ) can be written as 

d t Q + d$(E-E v ) + d> } (F-F v ) + di;(G-G v ) = S (1) 

The governing equations are integrated in time for both steady and unsteady flow calculations. For steady- 
state solutions, a is set to 1. An unfactored implicit scheme can be obtained from a nonlinear implicit 
scheme by linearizing the flux vectors about the previous time step and dropping terms of second and 
higher orders. 


[/ + aA t(D*A + DnB + Dt£- ff)]A Q « RHS (2) 

where 


RHS - -A i[Ds(E - Fv) + D n (F - Fv) + Dc(G - Gv) - S ] (3) 

/ is the identity matrix and A Q denotes the correction. A r B, C, and H are the Jacobian matrices of the 

convective flux vectors and the source term respectively. Artificial dissipation models augment a 
piecewise-constant cell-centered finite-volume formulation of the right hand side . 5 

Direct inversion of a large block banded matrix becomes impractical in three dimensions because of the 
rapid increase of computational work and fee large storage requirement. The LU-SGS scheme is one of the 
approximate factorization methods to alleviate the difficulties in three dimensions. Let subscripts / and 5 
indicate fluid and species transport equations respectively. The loosely-coupled method solves the Navier- 
Stokes and species transport equations separately but the solutions are updated simultaneously at each time 
step. 

LD~ l UAQ = RHS (5a) 

where 

L f D f ~ x U f bJQ f =RHS f (7) 


L f = I +aAl(D t A} + d;b} + d;c } 

- A}-B}-C } ) 


D f = / +aAt(A} +B} +C} -AJ-BJ -CJ) 


U f -/ +aA t(D' ( Aj +D;BJ +DICJ 


+ A}+B } +C }) 

(7b) 

L S D~ X U S AQ S — RHS s 

( 8 ) 
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L s = i + aM(D,A: +D n s: +D~c: 

-a;-b;-c;-h) 
d s - / +aA/(^; +b: + c; - ii; -b; - c; ) 
t/, - / * «Ar(z>;^; + d;b; + d;c; 

+^; + 2?;+c;) (8b) 


The loosely-coupled partially-implicit scheme includes the source Jacobian term H only in the L s factor. 
Solving die equations in a loosely-coupled manner ignores such terms in the Jacobian matrix A, for 
example, as dEj f dQ s and dE s / dQ f . 

Parallelization methods 

The original vector code ran inefficiently on cache-based systems. First, manual optimization that included 
array changes enhanced the performance of the serial code greatly. The LU-SGS scheme in the code was 
vectorized on a hyperplane where i+j+k=const. The key element was the conversion of three-dimensional 
indices (ij,k) to two-dimensional ones (ipoint, iplane ) 4 . 

Once the hyperplane is formed, it is straightforward to parallelize the algorithm by partitioning the plane. 
The method has the limitation that parallelism is restricted to points within one hyperplane, in order to 
improve memory access and to alleviate communication related problems, the code has been converted 
manually to use a canonical ordering. The restructured code improves the serial performance by a factor of 
two, already a significant speed-up on its own. Then the processors are scheduled like a pipeline on die 
outermost loop level. Sequential operations in each processor are performed in a cache. This approach 
exploits partial parallelism in loops that carry dependencies. 

Both hyperplane and pipeline codes are parallelized using Computer-Aided Parallelizer and Optimizer 
(CAPO) parallelization tool for die OpenMP parallelization. This task would have been very time 
consuming when performed manually, particularly in view of the fact that the code requires sophisticated 
parallelization techniques such as pipelined thread execution, which is not available via automatic 
parallelization of the vendor commercial compiler. The rapid tool based parallelization allows for the 
comparison of different strategies and to choose the most efficient implementation. 

The parallelization is non-trivial, since the implementation gives rise to a number of conservative and 
actual data dependencies. CAPO uses the extensive dependency analysis module of the ParaWise system, 
and, based on the information resulting from die analysis, inserts OpenMP directives into the source code. 
The following features of CAPO, which are not available via automatic compiler parallelization, are 
essential for die efficiency of die parallel code. CAPO provides an extensive set of browsers to allow user 
interaction for improvements of the generated code. This makes it possible to interactively declare the 
scope of certain as either shared or private and thereby removing conservatively assumed dependencies, 
which would inhibit parallelization for the compiler. CAPO optimizes the parallel code by merging die 
parallelized loops within a routine into a large parallel region. This reduces time spent in overhead to fork 
and join at the beginning and end of parallel loops. 


Preliminary Results 

The SGI Origin and Altix shared-memory systems are based on 0.6 GHz RISC and 1.5 GHz Intel Itanium- 
2 processors respectively. Timings for die serial and the parallel executions were obtained using the -02 
optimization compiler flag during compilation. 

In order to investigate the performance of parallel Gauss-Seidel methods for reacting flow, a scramjet 
problem has been calculated as a test case. While the weight of the oxygen tank exceeds thirty percent of 


1 


the total weight of the Space Shuttle at launch, only one percent of the total weight is for the payload. The 
air-breathing rocket propulsion systems, which consume oxygen in die air, offer clear advantages by 
making vehicles lighter and more efficient Fuel-air mixing and rapid combustion are of crucial importance 
for the success of scramjet engines since the spreading rate of the supersonic mixing layer decreases as the 
Mach number increases. In our test case, hydrogen fuel is injected transversely to incoming supersonic flow 
of air. Hie incoming air speed, pressure and temperature are assumed to be Mach 2, 1 atm and 1 ,000° K. 
Gaseous hydrogen is injected at die sonic speed through a hole at the bottom whose non-catalytic wall is 
cooled at 600° K. The length of combustion chamber is 40 times the diameter of injector. The Reynolds 
number based on die length is approximately 10 5 . A 257 x 257 x 257 structured grid (approximately 17 
million points) has been used with symmetric boundary conditions at the top and side walls. Supersonic 
flow boundary conditions are imposed at the inlet and outlet planes. 

Figure 1 compares the parallel efficiency of the pipeline code on SGI Altix and Origin systems. Both 
systems show a comparable performance up to 16 processors while the efficiency of the Origin are better 
than die Altix on 32 and 64 processors. However, the Altix appears to outperform the Origin on 128 
processors since the speedup of die Origin reaches a plateau at 64 processors. Figure 2 shows the relative 
speedup of the Altix over die Origin. It is not surprising that the A ltix is two to three times faster than the 
Origin considering the speed of Altix chip is 2.5 times faster than the Origin’s. What is interesting is that 
the best performance of die Altix seems to be at 128 processors. Final manuscript will include a detailed 
analysis of different parallelization methods. 

Summary 

Parallelization methods have been implemented for a symmetric Gauss-Seidel relaxation algorithm in 
conjunction with a loosely-coupled scheme for chemically reacting non-equilibrium flow. Both hyperplane 
and pipeline methods have been implemented into Real-Gas Aerodynamic Simulator code using openMP 
directives on cache coherent non-uniform memory access architecture. Performance of the parallelization 
methods have been demonstrated on SGI Altix and Origin shared memory systems. 
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Fig. 1. Parallel Efficiency of Altix and Origin 



Fig. 2. Relative Speedup of Altix over Origin 



